Method and system for concurrent use of two or more closely coupled communication recognition modalities

ABSTRACT

A method and system are provided in which a speech recognition system and one or more other input modalities are run in parallel. The system is especially useful in easing the Chinese input obstruction due to the difficulty of conforming Chinese ideograms to typical processor system input devices. The system allows a user to concurrently input information using different modalities. For example, a user may input spoken words and written characters (or data through other modalities). Each modality produces a list of possible words and their probability of being the input. If the most probable possible word is the same for both modalities then it is selected as the input. If not, an acoustic profile of the most probable speech input or an average acoustic profile of the possible speech inputs is used to validate a possible word of the other modalities or present a list of more probable possibilities to the user. The user may modify the written characters (or other input) from the provided possibilities. In this way the differing input modalities do not interfere with each other, but instead complement each other.

FIELD OF THE INVENTION

The present invention relates generally to a system for the translationof human input to electronic data, and more specifically to theconcurrent use of two or more closely coupled input modalities toincrease the accuracy and robustness of such a system while making thesystem easier to use.

BACKGROUND

Speech recognition systems or handwriting recognition systems for thetranslation of human input to electronic data are currently beingdeveloped.

One type of speech recognition system is the command and control type.The command and control type uses a grammar to format the input for moreaccurate recognition. A grammar may constrain the input to the system insome way so as to reduce the number of possible inputs. For example theinput may be constrained to be an “action” followed by a “target” of theaction. For example, if the user inputs the spoken words “open a file”then “open” would be recognized from a group of action words and “file”would be recognized from a group of target words. The constraints placedupon the system by a grammar are language specific and can yield highrecognition accuracy.

A second type of speech recognition system is the dictation type. Thedictation type of speech recognition is not constrained by a specificgrammar, but by a less stringent language modeling. Whatever input isspoken the system will attempt to recognize the input word by word usingstatistical information. This system is more flexible, but yields lowerrecognition accuracy. The accuracy level of the dictation type of speechrecognition system is currently too low to provide a generally practicalsystem.

These two types of speech recognition systems have their counterparts inthe handwriting recognition arena. The “pen gesture” type of handwritingrecognition system is analogous to the command and control type ofspeech recognition system in that the input is constrained by formattedstructures known as a template. The less structured handwritingrecognition system is simply known as handwriting and attempts torecognize handwritten input at the letter or word level without arequired format.

It has been recognized that using both a speech recognition system and ahandwriting recognition system in tandem could significantly improve thespeed and accuracy of a translation system. The coupling of theconstrained type of each system (i.e., the command and control type ofspeech recognition system and the pen gesture type of handwritingsystem) has been possible because both use a constraint system. Thecommand and control type of speech recognition system uses a grammar andthe pen gesture type of handwriting recognition system uses a template.The grammar and the template function similarly in their respectivesystems. The coupling of the two modalities has yielded improved systemperformance.

A more flexible and versatile system would be the coupling of the leastconstrained types of each system (i.e., the dictation type of speechrecognition system and the handwriting type of handwriting recognitionsystem).

An accurate, robust, and easy to use translation system for human inputis especially important for Chinese language users. The Chinese languageis made up of tens of thousands of pictographic words known as ideogramsthat are combined to create other words. One common way to inputinformation to a processing system is through the use of Pinyin. Pinyinis a system for transforming Chinese ideograms into Roman alphabet basedwords. For many people who use Chinese as their native language, it isan arduous task to input information into a processing system. This taskdiscourages many from accessing the many devices that rely on humaninput to a processing system. Whereas in the English language manypeople can type words faster than they can speak them, this is much moredifficult when the words must be initially translated from Chineseideograms.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitedby the figures of the accompanying drawings in which like referencesindicate similar elements and in which:

FIG. 1 is a diagram illustrating an exemplary digital processing systemfor implementing the present invention;

FIGS. 2A and 2B depict a process flow diagram of one embodiment of thepresent invention;

DETAILED DESCRIPTION

1.1 Overview

According to one aspect of the present invention a method and system areprovided in which a speech recognition system and one or more otherinput modalities are run in parallel. The system allows a user toconcurrently input information using different modalities. For example,a user may input spoken words and one or more other type of inputincluding written characters (e.g., Chinese ideograms or pinyin), andbody movements such as sign language, among others.

The differing input modalities do not interfere with each other, butinstead complement each other. The system cross-references and indexesthe output from the individual input modalities to provide a faster,more accurate, and more robust translation system. An acoustic profileof the most probable speech input or an average acoustic profile of thepossible speech inputs is used to validate the input through othermodalities.

FIG. 1 is a diagram illustrating an exemplary digital processing system100 for implementing the present invention. The input recognition anddata processing techniques described herein can be implemented andutilized within digital processing system 100, which can represent ageneral-purpose computer, portable computer, hand-held electronicdevice, or other like device. The components of digital processingsystem 100 are exemplary in which one or more components can be omittedor added. For example, one or more memory devices can be utilized fordigital processing system 100.

Referring to FIG. 1, digital processing system 100 includes a centralprocessing unit 102 and a signal processor 103 coupled to a displaycircuit 105, main memory 104, static memory 106, and mass storage device107 via bus 101. Digital processing system 100 can also be coupled to adisplay 121, keypad input 122, cursor control 123, hard copy device 124,input/output (I/O) devices 125, and audio/speech device 126 via bus 101.

Bus 101 is a standard system bus for communicating information andsignals. CPU 102 and signal processor 103 are processing units fordigital processing system 100. CPU 102 or signal processor 103 or bothcan be used to process information and/or signals for digital processingsystem 100. Signal processor 103 can be used to process speech or audioinformation and signals for speech processing and recognition.Alternatively, CPU 102 can be used to process speech or audioinformation and signals for speech processing or recognition. CPU 102includes a control unit 131, an arithmetic logic unit (ALU) 132, andseveral registers 133, which are used to process information andsignals. Signal processor 103 can also include similar components as CPU102.

Main memory 104 can be, e.g., a random access memory (RAM) or some otherdynamic storage device, for storing information or instructions (programcode), which are used by CPU 102 or signal processor 103. For example,main memory 104 may store speech or audio information and instructionsto be executed by signal processor 103 to process the speech or audioinformation. Main memory 104 may also store temporary variables or otherintermediate information during execution of instructions by CPU 102 orsignal processor 103. Static memory 106, can be, e.g., a read onlymemory (ROM) and/or other static storage devices, for storinginformation or instructions, which can also be used by CPU 102 or signalprocessor 103. Mass storage device 107 can be, e.g., a hard or floppydisk drive or optical disk drive, for storing information orinstructions for digital processing system 100.

Display 121 can be, e.g., a cathode ray tube (CRT) or liquid crystaldisplay (LCD). Display device 121 displays information or graphics to auser. Digital processing system 101 can interface with display 121 viadisplay circuit 105. Keypad input 122 is a alphanumeric input devicewith an analog to digital converter, for capturing sounds of speech inan analog form and transforming handwritten data into digital form,which can be used by signal processor 203 and/or CPU 102, forhandwriting recognition and for communicating information and commandselections to digital processing system 100. Cursor control 123 can be,e.g., a mouse, a trackball, or cursor direction keys, for controllingmovement of an object on display 121. Hard copy device 124 can be, e.g.,a laser printer, for printing information on paper, film, or some otherlike medium. A number of input/output devices 125 can be coupled todigital processing system 100. For example, a video camera can becoupled to digital processing system 100 through which video input couldbe received by the signal processor 203 and/or CPU 102. Audio/speechdevice 126 can be, e.g., a microphone with an analog to digitalconverter, for capturing sounds of speech in an analog form andtransforming the sounds into digital form, which can be used by signalprocessor 203 and/or CPU 102, for speech processing or recognition.

The speech processing techniques described herein can be implemented byhardware and/or software contained within digital processing system 100.For example, CPU 102 or signal processor can execute code orinstructions stored in a machine-readable medium, e.g., main memory 104,to process or to recognize speech.

The machine-readable medium may include a mechanism that provides (i.e.,stores and/or transmits) information in a form readable by a machinesuch as computer or digital processing device. For example, amachine-readable medium may include a read only memory (ROM), randomaccess memory (RAM), magnetic disk storage media, optical storage media,flash memory devices. The code or instructions can be represented bycarrier wave signals, infrared signals, digital signals, and by otherlike signals.

FIGS. 2A and 2B depict a process flow diagram of one embodiment of thepresent invention. The process 200 begins in FIG. 2A at operation 205 inwhich the system receives input through more than one modality. Theprocess 200 shows a handwriting recognition system (FRS) coupled to aspeech recognition system (SRS). Other input modalities (e.g., signlanguage) may be added in alternative embodiments.

At operation 210 a the FRS produces n-best possibilities for the inputhandwriting and in operation 210 b the SRS produces n-best possibilitiesfor the input speech. For example, for a handwritten input of “rat” theHRS might produce an n-best input pool of (rat, hat, not, hot). For aspoken input of “rat” the SRS might produce an n-best input pool of(rat, rad, rut, writ): The n-best possibilities for each modality areordered based on a probability assigned by the particular recognitionsystem.

At operation 215 a and 215 b, a meta-feature structure is created foreach of the n-best-input pools for each input word from the HRS modalityand from the SRS modality. The meta-feature structure takes thefollowing form.

-   -   meta-feature structure=(Probability, Start timestamp, {Content},        End timestamp)

The meta-feature structure includes the input pool word labeled“Content” together with the system-determined prol-bility that the“Content” is the word the user input. Also included in the meta-featurestructure is a start timestamp and an end timestamp for the particularinput. At operation 220 the timestamps from the meta-feature structureof each modality are compared. Typically there may be some delay betweencorresponding written and spoken input. At operation 225 the system usesthe start and end timestamps of each modality to determine if theduration of each overlaps within some predetermined time differential.At operation 225 a there is no overlap of the input duration time foreach modality. This indicates that the user input data using only onemodality (i.e., the user either spoke or wrote, but not both at the sametime). In this most simple case the system simply outputs the “Content”feature of the meta-file structure having the highest probability forthe modality that was used.

At operation 225 b the input duration does overlap for two or moremodalities. This means the user input data using more than one modality.The system checks the “Content” feature having the highest probabilityfor each of the modalities that were used. At operation 230 the systemdetermines if the most likely output from each modality is the same. Ifso, the system outputs that “Content” feature (i.e., the “Content”feature having the highest probability in each modality used) atoperation 230 a. In this case the system has used each of the modalitiesas a redundant check upon the other(s).

At operation 230 b the highest probability “Content” feature of eachmodality is not equal. This means for example that the FRS hasdetermined that the user input is most likely a certain word, and theSRS has determined that the user input is most likely some other word.The system will then develop an acoustic profile for the n-bestpossibilities of the SRS and the n-best possibilities of the HRS. Thesystem then uses a Gaussian distribution to determine if there is amatch between the acoustic profile of the highest probability “Content”feature from the SRS input pool and the acoustic profile of any of then-best possibilities from the HRS input pool at operation 235. ForExample, for a spoken input of “rat” the SRS might produce a highestprobability “Content” feature of rat. For the handwritten input of “rat”the FRS might produce an n-best input pool of (hat, not, rat, hot). Thishappens because the letter “r”, when written, may appear similar to theletters “h” and “n”, and the letter “a” may appear similar to “o”. Whenhandwritten, “hat” may appear similar to “rat”, but their acousticprofiles may not appear similar at all. Therefore, in the above examplethe acoustic profile of the highest probability “Content” feature fromthe SRS input pool will match the acoustic profile of a less probablepossibility in the HRS input pool. Because it is the only one thatmatches the acoustic profile it will be selected as the correct output.

This method has a high degree of accuracy when translating Chineseideograms because similar ideograms generally have very differentacoustic profiles. When an acoustic profile is made of the highestprobability “Content” feature from the SRS input pool it will likelymatch at most one of the pinyin-based n-best possibilities from the FRSinput pool. FIG. 3 shows a simplified example of system output accordingto one embodiment. In example 300 shown in FIG. 3, input 302 to the SRSis the spoken word “mao” and input 304 is the corresponding handwrittenChinese ideogram. The resultant SRS n-best possibilities 308 for the SRSinput 302 are very close in pronunciation to the spoken input 302. Asshown “mao” is among the possibilities, and has the highest probabilityat 81%, therefore an acoustic profile will be made for “mao”.

The resultant HRS n-best possibilities 312 for the HRS input 304 arevery similar looking Chinese ideograms. The ideogram corresponding to“mao” is among the possibilities and has the second highest probabilityat 87%. However, words 314 corresponding to the Chinese ideograms wouldnot have a similar pronunciation (i.e., acoustic profile).

Referring to FIG. 2A, the acoustic profiles are compared at operation240, and if a match is found it is output at operation 240 a. Forexample, using the data of FIG. 3; “yia”, when depicted as a Chineseideogram, had the highest probability of being the input word. However,the acoustic profile of “Yia” would not match the acoustic profile ofthe highest probability content feature from the SRS (i.e., “mao”). Onlyone of the HRS n-best possibilities yields an acoustic profile thatmatches the acoustic profile of the highest probability content featurefrom the SRS and that is the ideogram corresponding to “mao” which istherefore selected. There may not be any acoustic profiles of the n-bestpossibilities of the FRS input pool that match the acoustic profile ofthe highest probability content feature from the SRS.

Process 200 continues in FIG. 2B. At operation 240 b, shown in FIG. 2Bthe acoustic profile of the highest probability “Content” feature fromthe SRS input pool does not match the acoustic profile of any of theacoustic profiles from the n-best possibilities of the FRS input pool.The system then develops an average acoustic profile from the acousticprofiles of the n-best possibilities of the SRS input pool. At operation245 the acoustic profiles for the n-best possibilities of the SRS andthe n-best possibilities of the FRS are combined in one input pool. Thesystem then compares the average acoustic profile from the acousticprofiles of the n-best possibilities of the SRS input pool to all of theacoustic profiles in the combined input pool. At operation 250 thedifference is calculated between the average acoustic profile of then-best possibilities of the SRS input pool and each of the acousticprofiles in the combined input pool. At operation 255 the difference isused to reorder the possibilities of the combined input pool. That is,the possible input of the combined input pool having an acoustic profilewith the least difference from the average acoustic profile of the SRSinput pool will now be ordered highest (i.e., deemed to be mostprobable). The other possibilities are likewise ordered. At operation256 the highest ordered possibility produced by the HRS is selected ifthe calculated difference between the acoustic profile of thispossibility and the average acoustic profile is below a specifiedthreshold. If the difference in acoustic profiles is below the specifiedthreshold it may be selected even though possibilities produced by theSRS may be ordered higher (i.e., have acoustic profiles that differ fromthe average acoustic profile by a smaller amount). It may be that thecalculated difference between the average acoustic profile and theacoustic profile of the highest ordered possibility produced by the FRSis above the specified threshold. In this case the highest orderedpossibility produced by the IRS may not be selected.

At operation 260 the reordered possibilities from the combined inputpool are presented to the user to choose the input. Alternatively, onlythe most probable portion of the reordered possibilities is presented.The user may doodle pen gesture commands to simultaneously alter theinput character from the possibilities presented. In an embodiment usingChinese ideograms as handwritten input, the user may write pinyincorresponding to the ideograms in order to acoustically refine thepossible selection.

In the foregoing specification the invention has been described withreference to specific exemplary embodiments thereof. It will, however,be evident that various modifications and changes may be made theretowithout departing from broader spirit and scope of the invention as setforth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative rather a restrictivesense.

1. A method comprising: receiving input through a plurality of input modalities; ascertaining an input pool for the input of each modality, the input pools containing possible words for the input, the possible words of each pool ordered based upon a probability that a possible word is the input; and selecting a highest probability possible word if the input pools for each modality contain the same word as the highest probability possible word.
 2. The method of claim 1, wherein an input modality is speech.
 3. The method of claim 2, wherein an input modality is handwriting.
 4. The method of claim 3, wherein handwriting is Chinese ideograms.
 5. The method of claim 3, wherein the possible words are written in pinyin.
 6. The method of claim 5, further comprising: developing an acoustic profile of the highest probability possible word of the speech input pool; developing an acoustic profile for each of the possible words of the handwriting input pool; and selecting the highest probability possible word of the speech input pool if the highest probability possible word of the speech input pool matches an acoustic profile of a possible word of the handwriting input pool.
 7. The method of claim 5, further comprising: developing an average acoustic profile for the possible words of the speech input pool; developing an acoustic profile for each of the possible words of the handwriting input pool and the speech input pool to create a combined input pool; calculating a difference between the average acoustic profile and each of the acoustic profiles of the combined input pool; reordering the possible words of the combined input pool based upon the difference, such that the possible word of the combined input pool having an acoustic profile with a least difference from the average acoustic profile is ordered first; selecting the handwriting input pool possibility, from the combined pool, having the least difference if the difference is below a specified threshold. displaying the reordered possible words of the combined input pool; and receiving user modification to the handwriting input.
 8. An apparatus comprising: means for receiving input through a plurality of input modalities; means for ascertaining an input pool for the input of each modality, the input pools containing possible words for the input, the possible words of each pool ordered based upon a probability that a possible word is the input; and means for selecting a highest probability possible word if the input pools for each modality contain the same word as the highest probability possible word.
 9. The apparatus of claim 8, wherein an input modality is speech.
 10. The apparatus of claim 9, wherein an input modality is handwriting.
 11. The apparatus of claim 10, wherein handwriting is Chinese ideograms.
 12. The apparatus of claim 10, wherein the possible words are written in pinyin.
 13. The apparatus of claim 12, further comprising: means for developing an acoustic profile of the highest probability possible word of the speech input pool; means for developing an acoustic profile for each of the possible words of the handwriting input pool; and means for selecting the highest probability possible word of the speech input pool if the highest probability possible word of the speech input pool matches an acoustic profile of a possible word of the handwriting input pool.
 14. The apparatus of claim 12, further comprising: means for developing an average acoustic profile for the possible words of the speech input pool; means for developing an acoustic profile for each of the possible words of the handwriting input pool and the speech input pool to create a combined input pool; means for calculating a difference between the average acoustic profile and each of the acoustic profiles of the combined input pool; means for reordering the possible words of the combined input pool based upon the difference, such that the possible word of the combined input pool having an acoustic profile with a least difference from the average acoustic profile is ordered first; means for selecting the handwriting input pool possibility, from the combined pool, having the least difference if the difference is below a specified threshold. means for displaying the reordered possible words of the combined input pool; and means for receiving user modification to the handwriting input.
 15. A machine-readable medium that provides executable instructions, which when executed by a digital processing system, cause the set of processors to perform a method comprising: receiving input through a plurality of input modalities; ascertaining an input pool for the input of each modality, the input pools containing possible words for the input, the possible words of each pool ordered based upon a probability that a possible word is the input; and selecting a highest probability possible word if the input pools for each modality contain the same word as the highest probability possible word.
 16. The machine-readable medium of claim 15, wherein an input modality is speech.
 17. The machine-readable medium of claim 16, wherein an input modality is handwriting.
 18. The machine-readable medium of claim 17, wherein handwriting is Chinese ideograms.
 19. The machine-readable medium of claim 17, wherein the possible words are written in pinyin.
 20. The machine-readable medium of claim 19, further comprising: developing an acoustic profile of the highest probability possible word of the speech input pool; developing an acoustic profile for each of the possible words of the handwriting input pool; and selecting the highest probability possible word of the speech input pool if the highest probability possible word of the speech input pool matches an acoustic profile of a possible word of the handwriting input pool.
 21. The machine-readable medium of claim 19, further comprising: developing an average acoustic profile for the possible words of the speech input pool; developing an acoustic profile for each of the possible words of the handwriting input pool and the speech input pool to create a combined input pool; calculating a difference between the average acoustic profile and each of the acoustic profiles of the combined input pool; reordering the possible words of the combined input pool based upon the difference, such that the possible word of the combined input pool having an acoustic profile with a least difference from the average acoustic profile is ordered first; selecting the handwriting input pool possibility, from the combined pool, having the least difference if the difference is below a specified threshold. displaying the reordered possible words of the combined input pool; and receiving user modification to the handwriting input. 